IN THE CLAIMS: 



Please amend the claims as follows. No new matter has been introduced. 

1 . (Currently Amended) A program storage device readable by a machine, 
tangibly embodying a program of instructions executable by the machine to perform 
method steps for speech synthesis, the method steps comprising: 

determining prosodic parameters of a spoken utterance; 

providing a text string comprising a plurality of words and phonemes and 
corresponding spoken audio signal; 

extracting acoustic feature data from said audio signal; 

aligning the text string and the acoustic feature data and outputting a set of 

duration contours indicative of the duration of each word and phoneme; 
extracting pitch contour parameters from said audio spoken input; 

automatically generating a marked-up text corresponding to the spoken utterance 
using the pitch and duration contours prosodic parameters ; and 

generating a synthetic waveform using the marked-up text. 

2-3. (Canceled) 

4. (Currently Amended) The program storage device of claim 3, wherein the 
instructions for aligning comprise instructions for segmenting said spoken audio signal 
into time-segmented regions, wherein each time-segmented region is mapped to a 
corresponding phoneme extracting acoustic feature data from the spoken utterance and 
time aligning the spoken input to the corresponding text string using the acoustic feature 

5. (Currently Amended) The program storage device of claim 3, wherein the 
alignment is performed using a Viterbi alignment process. 
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6. (Canceled). 



7. (Currently Amended) The program storage device of claim 1 , wherein the 
instructions for automatically generating a marked-up text comprise instruction for 
directly specifying the pitch contour and duration prosodic parameters as attribute values 
for mark-up elements. 

8. (Original) The program storage device of claim 1 , wherein the instructions 
for automatically generating a marked-up text comprise instructions for assigning abstract 
labels to the pitch contour and duration prosodic parameters to generate a high-level 
markup. 

9. (Original) The program storage device of claim 1 , wherein the marked-up 
text is generated using SSML (speech synthesis markup language). 

10. (Original) The program storage device of claim 1 , further comprising 
instruction for processing phonetic content of the spoken utterance to generate the 
synthetic waveform having a desired pronunciation. 

1 1 . (Currently Amended) A method for speech synthesis, comprising the steps 

of: 

determining prosodic parameters of a spoken utterance; 
providing a text string comprising a plurality of words and phonemes and 
corresponding spoken audio signal; 

extracting acoustic feature data from said audio signal; 

aligning the text string and the acoustic feature data and outputting a set of 
duration contours indicative of the duration of each word and phoneme; 

extracting pitch contour parameters from said audio spoken input; 
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automatically generating a marked-up text corresponding to the spoken utterance 
using the pitch contour and duration parameters; and 

generating a synthetic waveform using the marked-up text. 

12-13. (Canceled) 

14. (Currently Amended) The method of claim 13, wherein aligning comprises 
extracting acoustic feature data from the spoken utterance and time-aligning the spoken 
input to the corresponding text string using the acoustic feature data. 

15. (Currently Amended) The method of claim 13, wherein aligning is 
performed using a Viterbi alignment process. 

16. (Canceled) 

1 7. (Currently Amended) The method of claim 1 1 , wherein automatically 
generating a marked-up text comprises directly specifying the pitch contour and duration 
prosodic parameters as attribute values for mark-up elements. 

1 8. (Currently Amended) The method of claim 1 1 , wherein automatically 
generating a marked-up text comprises assigning abstract labels to the pitch contour and 
duration prosodic parameters to generate a high-level markup. 

1 9. (Original) The method of claim 1 1 , wherein the marked-up text is 
generated using SSML (speech synthesis markup language). 

20. (Original) The method of claim 1 1 , further comprising processing phonetic 
content of the spoken utterance to generate the synthetic waveform having a desired 
pronunciation. 
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21 . (Currently Amended) A text-to-speech (TTS) system, comprising: 

a prosody analyzer for determining prosodic parameters of a spoken utterance 
corresponding to an input text string and automatically generating a marked-up text 
corresponding to the spoken utterance using the prosodic parameters, wherein the prosody 
analyzer comprises: 

an acoustic feature extraction module that extracts acoustic feature data 

from said spoken utterance; 

an alignment module for aligning the input text string with the spoken 

utterance using said acoustic feature data to generate duration contour information of 
elements comprising the input text string; 

a pitch contour extraction module for determining pitch contour 

information for the spoken utterance; and 

a conversion module for including markup in the input text string in 

accordance with the duration and pitch contour information to generate the marked up 
text ; and 

a TTS system for generating a synthetic waveform using the marked-up text. 

22. (Original) The system of claim 21 , further comprising a user interface that 
enables a user to input the spoken utterance and input a text string corresponding to the 
spoken utterance. 

23 . (Original) The system of claim 2 1 , wherein the prosody analyzer processes 
phonetic content of the spoken utterance to generate the synthetic waveform having a 
desired pronunciation. 

24. (Canceled) 

25. (New) The program storage device of claim 1, wherein extracting acoustic 
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feature data from said audio signal comprises digitizing the spoken audio signal into a set 
of frames and transforming the digitized input waveforms into a set of feature vectors on 
a frame-by-frame basis. 

26. (New) The program storage device of claim 25, wherein transforming the 
digitized input includes producing a 24-dimensional cepstra feature vector for every 10ms 
of the spoken audio signal, concatenating frames to the left and to the right of a current 
frame to augment a current cepstral vector, and reducing each augmented cepstral vector 
to a 60-dimensional feature vector using linear discriminant analysis. 

27. (New) The method of claim 1 1 , wherein extracting acoustic feature data 
from said audio signal comprises digitizing the spoken audio signal into a set of frames 
and transforming the digitized input waveforms into a set of feature vectors on a frame- 
by-frame basis. 

28. (New) The method of claim 27, wherein transforming the digitized input 
includes producing a 24-dimensional cepstra feature vector for every 10ms of the spoken 
audio signal, concatenating frames to the left and to the right of a current frame to 
augment a current cepstral vector, and reducing each augmented cepstral vector to a 60- 
dimensional feature vector using linear discriminant analysis. 
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